Introduction - Why Do This Analysis?

One of the most interesting Unsupervised Machine Learning techniques is clustering. Clustering helps us find interesting ways to group items based on how similar they are. One useful application of clustering is in higher education. For example, consider two schools in my home state of Ohio. Should we think of the College of Wooster as a comparable school to the University of Dayton? Both are private colleges that enroll less than 10,000 students, have a healthy share of out-of-state students, and charge relatively high (>$40,000) tuition. When we’re classifying universities, should they be in the same group?

For this analysis, I’ll attempt to classify schools based on their selectivity. ‘Selectivity’ is a vague metric, but there are a handful of meaningful statistics that represent selectivity based on university prestige and academic rigor. These are metrics such as tuition rates, admission rates, ACT scores for incoming students, and total enrollment.

Data Gathering & Exploration

For this analysis, we will make use of the IPEDS (Integrated Postsecondary Education Data System) database, a government source for higher education data. The latest data covers the 2019-2020 school year.

IPEDS publishes a Microsoft Access database every year with hundreds of variables covering thousands of higher education institutions. To simplify things, I exported the relevant tables from Access to Excel, to streamline the analysis & move away from the bulky Access database.

Load packages and connect to the Access database

For this analysis, we need the tidyverse as well as the readxl package to load the Excel data.

# Read data
institution <- read_excel("data\\HD2021.xlsx")
adm <- read_excel("data\\ADM2021.xlsx")
adm_yield <- read_excel("data\\DRVADM2021.xlsx")
enrollment <- read_excel("data\\DRVEF2021.xlsx")
enrollment_d <- read_excel("data\\EF2021D.xlsx")
tuition <- read_excel("data\\IC2021_AY.xlsx")
Build the data frame

We will require data on multiple categories. Note that IPEDS stores each variable with a mnemonic - so note that I have taken steps to translate each mnemonic into a more informative variable name.

#Institutional characteristics

institution <- institution %>% 
  select(UNITID, INSTNM, ICLEVEL, CONTROL,  LATITUDE, LONGITUD, ADDR, CITY, STABBR, ZIP, CBSA) %>% 
  rename(school = INSTNM)

#Admissions
adm <- adm %>% 
  select(UNITID, ACTCM75) %>% 
  rename(act_score = ACTCM75
         )
#DRAdmissions table
adm_yield <- adm_yield %>% 
  select(UNITID, DVADM01) %>% 
  rename(pct_admitted = DVADM01)

#Enrollment characteristics
enrollment <- enrollment %>% 
  select(UNITID, EFUG, EFUGFT, EFUGPT, PCUDEEXC, RMINSTTP, RMOUSTTP, RMFRGNCP) %>% 
  rename(ug_enroll = EFUG,
         ft_enroll = EFUGFT,
         pt_enroll = EFUGPT,
         pct_online = PCUDEEXC,
         pct_instate = RMINSTTP,
         pct_outstate = RMOUSTTP,
         pct_foreign = RMFRGNCP)
enrollment_d <- enrollment_d %>%
  select(UNITID, RET_PCF) %>% 
  rename(retention_rate = RET_PCF)
enrollment <- left_join(enrollment, enrollment_d, by = c("UNITID" = "UNITID"))

#Tuition
tuition <- tuition %>% 
  select(UNITID, TUITION2, TUITION3) %>% 
  rename(in_state_tuition = TUITION2,
         out_state_tuition = TUITION3)

Now we can join all this data together into a main data frame and filter the resulting data to only include 4-year, non-profit schools (ie, exclude 2 year/technical colleges and for-profit institutions, as those are outside the scope of this analysis).

###Join it all together###
school_data <- institution %>% 
  left_join(y = adm, by = c("UNITID" = "UNITID")) %>% 
  left_join(y = adm_yield, by = c("UNITID" = "UNITID")) %>% 
  left_join(y = enrollment, by = c("UNITID" = "UNITID")) %>% 
  left_join(y = tuition, by = c("UNITID" = "UNITID")) 
#And limit the analysis to 4 year private & public non-profit institutions
school_data <- school_data %>% 
  filter(ICLEVEL == 1,
         CONTROL != 3) %>% 
  select(-ICLEVEL)

#Let's create a key linking each school ID to a school, for later
school_id_key <- school_data %>% 
  select(UNITID, school)

Data Cleaning

We now have a primary data frame. Let’s take a look at it.

school_data <- school_data %>% 
  mutate(UNITID = as.character(UNITID))

summary(school_data)
##     UNITID             school             CONTROL         LATITUDE     
##  Length:2487        Length:2487        Min.   :1.000   Min.   :-14.32  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.: 34.43  
##  Mode  :character   Mode  :character   Median :2.000   Median : 39.60  
##                                        Mean   :1.667   Mean   : 37.97  
##                                        3rd Qu.:2.000   3rd Qu.: 41.78  
##                                        Max.   :2.000   Max.   : 71.32  
##                                                                        
##     LONGITUD           ADDR               CITY              STABBR         
##  Min.   :-170.74   Length:2487        Length:2487        Length:2487       
##  1st Qu.: -96.59   Class :character   Class :character   Class :character  
##  Median : -84.88   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : -88.82                                                           
##  3rd Qu.: -76.70                                                           
##  Max.   : 171.38                                                           
##                                                                            
##      ZIP                 CBSA         act_score      pct_admitted   
##  Length:2487        Min.   :   -2   Min.   :15.00   Min.   :  0.00  
##  Class :character   1st Qu.:18520   1st Qu.:24.00   1st Qu.: 63.00  
##  Mode  :character   Median :31080   Median :27.00   Median : 78.00  
##                     Mean   :29096   Mean   :27.01   Mean   : 72.81  
##                     3rd Qu.:38980   3rd Qu.:30.00   3rd Qu.: 89.00  
##                     Max.   :49780   Max.   :36.00   Max.   :100.00  
##                                     NA's   :1501    NA's   :835     
##    ug_enroll        ft_enroll        pt_enroll       pct_online    
##  Min.   :     0   Min.   :     0   Min.   :    0   Min.   :  0.00  
##  1st Qu.:   331   1st Qu.:   216   1st Qu.:   20   1st Qu.:  0.00  
##  Median :  1488   Median :  1132   Median :  155   Median :  7.00  
##  Mean   :  4302   Mean   :  3163   Mean   : 1139   Mean   : 17.16  
##  3rd Qu.:  4219   3rd Qu.:  2966   3rd Qu.:  826   3rd Qu.: 23.00  
##  Max.   :121884   Max.   :107952   Max.   :85914   Max.   :100.00  
##  NA's   :82       NA's   :82       NA's   :82      NA's   :329     
##   pct_instate      pct_outstate     pct_foreign     retention_rate  
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 52.00   1st Qu.:  7.00   1st Qu.: 0.000   1st Qu.: 65.00  
##  Median : 77.00   Median : 20.00   Median : 1.000   Median : 74.00  
##  Mean   : 69.46   Mean   : 26.73   Mean   : 3.036   Mean   : 73.18  
##  3rd Qu.: 91.00   3rd Qu.: 42.00   3rd Qu.: 4.000   3rd Qu.: 83.00  
##  Max.   :100.00   Max.   :100.00   Max.   :67.000   Max.   :100.00  
##  NA's   :1317     NA's   :1317     NA's   :1317     NA's   :612     
##  in_state_tuition out_state_tuition
##  Min.   :    0    Min.   :    0    
##  1st Qu.: 7070    1st Qu.:12000    
##  Median :13603    Median :20304    
##  Mean   :20566    Mean   :24077    
##  3rd Qu.:32800    3rd Qu.:33681    
##  Max.   :77200    Max.   :77200    
##  NA's   :342      NA's   :342

Now we can engage in some light feature engineering. One of these features will be to calculate an ‘average’ tuition rate for all students.

school_data <- school_data %>% 
  mutate(in_state_share = pct_instate/ (pct_instate + pct_outstate),
         out_state_share = 1-in_state_share,
         avg_tuition = (in_state_share * in_state_tuition) + (out_state_share * out_state_tuition)
         ) %>% 
  select(-in_state_share, -out_state_share)

summary(school_data$avg_tuition)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    7430   14700   21028   33231   62420    1323

One thing I noticed about the data in the first exploration is that the ACT score and percent admitted fields seem to have the most NAs. We can use of the plot_histogram() feature from the DataExplorer package to give a better sense of the NA data here.

## Warning: package 'DataExplorer' was built under R version 4.2.3

There doesn’t seem to be a unifying feature that links each school with missing ACT scores. Most schools are smaller, but there are some larger schools being left out. Accordingly, we can impute this missing data with the average value in our data, so we aren’t losing all these observations.

school_data <- school_data %>% 
  mutate(act_score = ifelse(is.na(act_score), mean(act_score, na.rm = TRUE), act_score))

Finally, k-means can only handle numeric data so we will remove any categorical features.

school_full <- school_data

school_data <- school_data %>% 
  filter(!is.na(act_score)) %>% 
  drop_na()

modeled_data <- school_data %>% 
  column_to_rownames(var = 'UNITID') %>% 
  mutate(pct_admitted = ifelse(is.na(pct_admitted), imputed_admissions, pct_admitted)) %>% 
  select(-ft_enroll, -pt_enroll, -pct_online, -pct_foreign, -retention_rate, -in_state_tuition, -out_state_tuition, -school,
         -CITY, -STABBR, -ZIP, -CBSA, -LATITUDE, -LONGITUD, -ADDR, -CONTROL) #These variables can explain the results of the analysis but themselves don't define 'selectivity' in our analysis

plot_histogram(modeled_data)

summary(modeled_data)
##    act_score      pct_admitted      ug_enroll       pct_instate    
##  Min.   :15.00   Min.   :  4.00   Min.   :    14   Min.   :  0.00  
##  1st Qu.:26.00   1st Qu.: 66.00   1st Qu.:  1474   1st Qu.: 49.75  
##  Median :27.01   Median : 79.00   Median :  2894   Median : 73.00  
##  Mean   :27.14   Mean   : 73.76   Mean   :  6732   Mean   : 66.54  
##  3rd Qu.:29.00   3rd Qu.: 89.00   3rd Qu.:  8013   3rd Qu.: 88.00  
##  Max.   :36.00   Max.   :100.00   Max.   :121884   Max.   :100.00  
##   pct_outstate     avg_tuition   
##  Min.   :  0.00   Min.   : 1156  
##  1st Qu.: 10.00   1st Qu.: 9395  
##  Median : 24.00   Median :19129  
##  Mean   : 29.43   Mean   :24253  
##  3rd Qu.: 45.00   3rd Qu.:36263  
##  Max.   :100.00   Max.   :62420

Now that we cleaned up the data, we can look at what fields will go into our k-means analysis.

colnames(modeled_data)
## [1] "act_score"    "pct_admitted" "ug_enroll"    "pct_instate"  "pct_outstate"
## [6] "avg_tuition"

K-Means Analysis

We have a nice, tidy dataset.

Now we can start analyzing our data! K-means requires that all data is on the same magnitude, so that fields with large values or wide arrays of values don’t have an unfair influence on the dataset. We can do that easily with the scale() function.

modeled_data <- modeled_data %>% 
  scale()

summary(modeled_data)
##    act_score         pct_admitted       ug_enroll        pct_instate     
##  Min.   :-3.70573   Min.   :-3.3577   Min.   :-0.6972   Min.   :-2.6124  
##  1st Qu.:-0.34814   1st Qu.:-0.3734   1st Qu.:-0.5457   1st Qu.:-0.6591  
##  Median :-0.03981   Median : 0.2523   Median :-0.3983   Median : 0.2538  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.56756   3rd Qu.: 0.7336   3rd Qu.: 0.1329   3rd Qu.: 0.8428  
##  Max.   : 2.70421   Max.   : 1.2631   Max.   :11.9499   Max.   : 1.3139  
##   pct_outstate      avg_tuition     
##  Min.   :-1.2651   Min.   :-1.4043  
##  1st Qu.:-0.8352   1st Qu.:-0.9033  
##  Median :-0.2333   Median :-0.3115  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6695   3rd Qu.: 0.7302  
##  Max.   : 3.0339   Max.   : 2.3205

Choosing the right value for ‘k’

While the ‘optimal’ number of clusters is ultimately a judgment call, there are a few ways to determine how many clusters we should settle with. There are three quantitative measures that help us here: 1) the ‘elbow’ method, which calculates the sum of distances between each item within a cluster for various values of k – and the ‘elbow point’ is the optimal number of clusters, since this is the point at which increasing k no longer yields a significant improvement in the total sum of squares; 2) The Silhouette Method - which calculates how closely an item is matched with other items in the same cluster and how loosely it is with other clusters, where a high average value is optimal; and 3) The Gap Statistic, which compares the difference between clusters in an observed dataset and clusters in randomly-generated datasets. These three methods are ultimately suggestions, and it’s up to the analyst to choose the optimal k.

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.3
###Choose the right k
#Elbow method
fviz_nbclust(modeled_data, kmeans, method = "wss") #This suggests that 2 or 4 is the optimal k

#Average silhouette method
fviz_nbclust(modeled_data, kmeans, method = "silhouette") #This suggests optimal k is 2

fviz_nbclust(modeled_data, kmeans, method = "gap_stat") #This suggests 10 is the optimal k

It seems like two of the three methods seem to point to 2 clusters. While there is no ‘one’ answer to this question, intuitively it makes sense to have somewhere between 3-6 groups for easy interpretation.

Performing clustering

set.seed(1234)
k_2 <- kmeans(modeled_data, centers = 2, nstart = 50)
k_4 <- kmeans(modeled_data, centers = 4, nstart = 50)

k_2$size
## [1] 631 293
k_4$size
## [1]  93 435 319  77
#Assign each school to a cluster
school_data <- school_data %>% 
  mutate(cluster_2 = k_2$cluster) %>% 
  mutate(cluster_4 = k_4$cluster)

Analyzing results

We can start by simply summarizing each of the resulting groups across our frameworks.

school_data %>% 
  select(cluster_2, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_2) %>% 
  summarize_all('mean')
## # A tibble: 2 × 8
##   cluster_2 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
##       <int>   <dbl>     <dbl>        <dbl>     <dbl>      <dbl>       <dbl>
## 1         1    1.39      26.2         79.7     7399.      13.5         80.7
## 2         2    1.89      29.1         61.0     5297.       7.78        35.9
## # ℹ 1 more variable: avg_tuition <dbl>
school_data %>% 
  select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_4) %>% 
  summarize_all('mean')
## # A tibble: 4 × 8
##   cluster_4 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
##       <int>   <dbl>     <dbl>        <dbl>     <dbl>      <dbl>       <dbl>
## 1         1    1.96      32.4         34.9     5452.       2.46        24.2
## 2         2    1.30      25.7         81.4     5569.      15.6         85.5
## 3         3    1.89      27.1         76.0     2728.       8.45        51.3
## 4         4    1.03      29.1         68.4    31441.      13.8         73.9
## # ℹ 1 more variable: avg_tuition <dbl>

We can also visualize the results. Note that this chart reduces the data into a two-dimensional graph via Principal Component Analysis, which makes the chart much less interpretable… but nonetheless gives a good sense of the rough shapes and overlaps present within our clusters.

fviz_cluster(k_2,
             data = modeled_data,
             repel = TRUE,
             geom = "point",
             ggtheme = theme_minimal(),
             title = "Clustering Results - 2 Clusters")

fviz_cluster(k_4,
             data = modeled_data,
             repel = TRUE,
             geom = "point",
             ggtheme = theme_minimal(),
             title = "Clustering Results - 4 Clusters")

To me, two groups doesn’t seem descriptive enough, so sticking with 4 groups feels right.

Let’s see the summary table on the four-group split again.

library(kableExtra)
school_data %>% 
  select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_4) %>% 
  summarize_all('mean') %>%
  kable(digits=2, format = 'markdown',row.names = TRUE) %>% 
  kable_styling(full_width = T,
                font_size = 12) %>% 
  row_spec(row = 0, color = '#660033')
cluster_4 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate avg_tuition
1 1 1.96 32.36 34.90 5451.74 2.46 24.15 52479.63
2 2 1.30 25.71 81.37 5569.05 15.64 85.48 13954.39
3 3 1.89 27.10 76.01 2728.06 8.45 51.29 32747.36
4 4 1.03 29.09 68.35 31440.71 13.84 73.88 13146.53

Evaluating the Clusters

  • Group 1 are the ‘elites’. This group stands out from the rest on its ability to charge a lot in average tuition – over $50,000 on average! – and accepts the lowest percentage of students.
  • Group 2 is the less selective category. They have the lowest entrance test requirements and the highest share of students that come from within their state. Although not every school in this group is public, schools in this group also charge a low average tuition.
  • Group 3 is the ‘small privates’. Although this group has, in general, similar academic requirements to Group 2 & 4, they charge over twice as much. These schools are small, on average, and they do a better job at attracting students from out-of-state.
  • Group 4 is the ‘large publics’ bucket - nearly every school in this group is a public school. The average tuition in this group is quite low, with an average of <$13,000. Based on admissions standards, this group is moderately selective.

It’s interesting that even though control is not a metric included in the visualization, the four groups have a relatively clean split between public and private ones.

Now that we have a sense of what makes up our data, we can give our clusters some more informative names.

summary_table <- school_data %>% 
    select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition, retention_rate) %>% 
  group_by(cluster_4) %>% 
  summarize_all('mean')

summary_table <- summary_table %>% 
  mutate(cluster_4 = ifelse(cluster_4=='1', 'Large Publics',
                            ifelse(cluster_4=='2', 'Less Selective',
                                   ifelse(cluster_4=='3', 'Small Privates', 'Elites')))) %>% 
  rename(cluster = cluster_4,
         control = CONTROL)

kable(summary_table, 'html', digits = c(0,2,2,2,0,1,1,0,2), align = 'lcccccccc') %>% 
  kable_styling(full_width = F)
cluster control act_score pct_admitted ug_enroll pct_online pct_instate avg_tuition retention_rate
Large Publics 1.96 32.36 34.90 5452 2.5 24.2 52480 90.65
Less Selective 1.30 25.71 81.37 5569 15.6 85.5 13954 70.41
Small Privates 1.89 27.10 76.01 2728 8.5 51.3 32747 73.92
Elites 1.03 29.09 68.35 31441 13.8 73.9 13147 86.64

Let’s drive this home with some visuals.

school_data <- school_data %>% 
  select(-cluster_2) %>% 
  rename(cluster= cluster_4) %>% 
  mutate(cluster = ifelse(cluster=='1', 'Elites',
                            ifelse(cluster=='2', 'Less Selective',
                                   ifelse(cluster=='3', 'Small Privates', 'Large Publics')))) %>% 
  mutate(cluster = as.factor(cluster))
library(RColorBrewer) #for pretty colors
#Average selectivity metrics by cluster
school_data %>% 
  select(cluster, act_score, pct_admitted, avg_tuition, ug_enroll) %>% 
  group_by(cluster) %>% 
  summarize_all('mean') %>% 
  pivot_longer(cols = -cluster) %>% 
  ggplot() +
  geom_col(aes(x = cluster, y = value, fill = cluster)) +
  facet_wrap(~name, scales = 'free') + 
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

#Scatterplot - tuition and act scores
# my_plot <- school_data %>% 
#   rename(undergrad_enrollment = ug_enroll) %>% 
#   ggplot() +
#   geom_point(aes(x = act_score, y = avg_tuition, color = cluster, size = undergrad_enrollment), alpha = 0.5) +
#   scale_color_brewer(palette = 'Dark2') +
#   theme_minimal() +
#   labs(x = 'ACT Score (Composite, 75th Percentile)',
#        y = 'Average Tuition Rate',
#        title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')

#ggplotly(my_plot)  # Convert to Plotly object

my_plot <- plot_ly(data = school_data, 
        x = ~act_score, y = ~avg_tuition, type = "scatter", mode = "markers",
        color = ~cluster,
        size = ~ug_enroll,
        hoverinfo = "text", text = ~paste("School: ", school, "<br> Undergrad Enrollment: ", ug_enroll)) %>%
  plotly::layout(title = "Selectivity Metrics for Schools Based on Cluster",
         xaxis = list(title = "ACT Score (Composite, 75th Percentile)"),
         yaxis = list(title = "Average Tuition Rate"))

ggplotly(my_plot)
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

Wrap-Up

So there you have it. Even though individual schools can have similar values for various selectivity measures, k-means helps us tell whether a school really is similar to another, not just based on one or two metrics, but the sum total of all the metrics that we deem important.

A good extension of this analysis could be to use the results of these groupings in a predictive model, allowing you to build different models for each cluster of schools. This is potentially useful since each resulting model’s coefficients need only to describe a certain subset of the data as opposed to all the data, which could ultimately result in a more accurate model.

And what about those schools I brought up in the beginning, College of Wooster and Dayton? Turns out they do belong to different groups in our analysis, with Dayton in the ‘Small Privates’ group and Wooster in the ‘Elites’ group.

library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.2.3
school_data %>% 
  rename(undergrad_enrollment = ug_enroll) %>% 
  ggplot(aes(x = act_score, y = avg_tuition)) +
  geom_point(aes(color = cluster, size = undergrad_enrollment), alpha = 0.5) +
  geom_text_repel(aes(label=ifelse(school == 'The College of Wooster' | school == 'University of Dayton', school, '')), nudge_x = 2, nudge_y = -4500, max.overlaps = 2000) +
  scale_color_brewer(palette = 'Dark2') +
  theme_minimal() +
  labs(x = 'ACT Score (Composite, 75th Percentile)',
       y = 'Average Tuition Rate',
       title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')